MPA 5830 - Module 06

Ani Ruhil

2018-08-06


Agenda

In this module we take small baby-steps in modeling, what is also referred to as data analytics. The goal of analytics is simple: Find the answer to some question(s). For example, why are low quality diamonds more expensive?; what influences the number of daily fights? In order to answer these types of questions we start with the usual visual explorations of the data but then eventually turn to fitting some regression-based models to the data. Let us start by understanding the basics of fitting regression models in R and then explore the flights data question.

Bivariate Regression Models

The usual regression model can be specified as \(y = a + b(x)\) or as some prefer to write it, \(y = m(x) + c\). With a single independent variable (x) and a single dependent variable (y) we can fit the regression model in R as follows:

## 
## Call:
## lm(formula = child ~ parent, data = Galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8050 -1.3661  0.0487  1.6339  5.9264 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.94153    2.81088   8.517   <2e-16 ***
## parent       0.64629    0.04114  15.711   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.239 on 926 degrees of freedom
## Multiple R-squared:  0.2105, Adjusted R-squared:  0.2096 
## F-statistic: 246.8 on 1 and 926 DF,  p-value: < 2.2e-16

Here the data represent pairs of observations, with parent referring to the parents’ mid-height and child referring to the child’s height. In this particular example, the regression equation turns out to be child's predicted height = 23.94153 + 0.64629(parent) and indicates that on average for ever 1-inch increase in the average height of the parents the child’s height is predicted to increase by 0.64 inches, on average. We also focus on the p-value of the independent variable, and the three asterisks we see against parent *** tell us it is a statistically significant independent variable. As a rule of thumb we will want the variable to have at least one * against it before we treat it as significant. If we see a . then we basically treat the independent variable as having no impact on the dependent variable.

When we fit these models in MPA 6010, remember that we also focused on two other things – (i) the average prediction error (average error you should expect in your prediction if you use this regression model) and the (ii) adjusted R-squared (how much of the variation in the dependent variable can be explained with this model). The lower the prediction error and the higher the adjusted R-squared, the better the model (loosely speaking of course).

In R, the Residual standard error is the average prediction error, which in the present case turns out to be 2.239; out model will, on average, overpredict/underpredict by 2.239. The adjusted R-squared is 0.2096 so 20.96% of the variation in a child’s height can be “explained” with this model. That obviously means we have about 79% of the variation left to explain, and that is quite a lot so the model is not very good.

Once we fit the model we also want to generate predicted values and that can be done via the following R commands.

If you look at the Galton data you will see the predicted child height for a given average height of the parents. We can plot these predicted child heights

Fair enough, but what if the independent variable were categorical? Let us see how the model works in that case, and the data we will use here is the diamonds data. I am going to flip cut into a regular factor since it is stored as an ordered factor in the data-frame.

## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z                cut.f      
##  Min.   : 0.000   Min.   : 0.000   Fair     : 1610  
##  1st Qu.: 4.720   1st Qu.: 2.910   Good     : 4906  
##  Median : 5.710   Median : 3.530   Very Good:12082  
##  Mean   : 5.735   Mean   : 3.539   Premium  :13791  
##  3rd Qu.: 6.540   3rd Qu.: 4.040   Ideal    :21551  
##  Max.   :58.900   Max.   :31.800                    
## 
## 
## Call:
## lm(formula = price ~ cut.f, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4258  -2741  -1494   1360  15348 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4358.76      98.79  44.122  < 2e-16 ***
## cut.fGood       -429.89     113.85  -3.776 0.000160 ***
## cut.fVery Good  -377.00     105.16  -3.585 0.000338 ***
## cut.fPremium     225.50     104.40   2.160 0.030772 *  
## cut.fIdeal      -901.22     102.41  -8.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3964 on 53935 degrees of freedom
## Multiple R-squared:  0.01286,    Adjusted R-squared:  0.01279 
## F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16

How many categories of cut.f are there? Five, and the regression shows you estimates for only four because the fifth and final category is absorbed into the (Intercept). We see the average prediction error to be 3964 and the adjusted R-squared to be 0.01279 (only 1.279% of the variation in price is explained by this model so it is not a very good model). Now we interpret the estimates.

  • If the diamond is of a Fair cut, the average predicted price is 4358.76
  • If the diamond is Good then the average predicted price is 4358.76 - 429.89 = 3928.87
  • A Very Good diamond has an average predicted price of 4358.76 - 377 = 3981.76
  • A Premium diamond has an average predicted price of 4358.76 + 225.50 = 4584.26
  • An Ideal diamond has an average predicted price of 4358.76 - 901.22 = 3457.54

Note the difference in how we interpret the results if it is a continuous independent variable versus when we have a categorical variable. For continuous variables the estimate is just how much the dependent variable changes by as the independent variable increases by 1. For a categorical independent variable we have to use the Intercept to calculate the predicted value of the dependent variable for a particular category of the independent variable.

What happens if we have more than one independent variable? Well, in that case we no longer have a bivariate regression model but instead we have a multiple regression model.

Multiple Regression Models

Let us build such a model with the diamonds data-set. I’ll use two independent variables – cut.f and carat.

## 
## Call:
## lm(formula = price ~ cut.f + carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17540.7   -791.6    -37.6    522.1  12721.4 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3875.47      40.41  -95.91   <2e-16 ***
## cut.fGood       1120.33      43.50   25.75   <2e-16 ***
## cut.fVery Good  1510.14      40.24   37.53   <2e-16 ***
## cut.fPremium    1439.08      39.87   36.10   <2e-16 ***
## cut.fIdeal      1800.92      39.34   45.77   <2e-16 ***
## carat           7871.08      13.98  563.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1511 on 53934 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8565 
## F-statistic: 6.437e+04 on 5 and 53934 DF,  p-value: < 2.2e-16

First, notice that the average prediction error has dropped to 1511 and the adjusted R-squared has gone up to 0.8565 (i.e., this model explains 85.65% of the variation in price and hence is a much better model than what we had before).

The (Intercept) is still referring to a Fair cut but because we have a continuous independent variable, carat, we will have to pick a value for carat and then generate the predicted price for a Fair cut diamond. The value one should pick for a continuous variable should be the median. In this case, we saw median carat to be 0.7000. Now we can write the regression equation and calculate predicted price for a fair cut 0.7 carat diamond as

\[predicted.price = -3875.47 + 7871.08(0.7000) = 1634.286\]

If it were a Good cut diamond of the same carat value then the predicted price would be

\[predicted.price = -3875.47 + 1120.33 + 7871.08(0.7000) = 2754.616\]

and similarly for an Ideal diamond with the same carat value,

\[predicted.price = -3875.47 + 1800.92 + 7871.08(0.7000) = 3435.206\]

Practice Tasks

  1. Calculate the predicted price for Very Good and Premium diamonds of the same carat value
  2. Calculate predicted price for all five cuts when the carat value is at its first quartile, and then when it is at its third quartile

A more in-depth look at the diamonds data

Let us generate box-plots of price by cut

Notice that all cuts have outliers and the median price is the highest for Fair diamonds. We can see the actual prices if we calculate the medians.

## # A tibble: 5 x 2
##   cut       Median.Price
##   <ord>            <dbl>
## 1 Fair             3282 
## 2 Good             3050.
## 3 Very Good        2648 
## 4 Premium          3185 
## 5 Ideal            1810

Why is that? Because the weight of the diamonds is very critical in determining price and some of the lower quality diamonds tend to be the highest.

Notice that most diamonds are in the less than 2.5 carat range. Let us focus on just this group, which makes up some 99.7% of the data, and then repeat the geom_hex plot. .

This plot shows a non-linear relationship between carat and price, a relationship that is easy to see if we draw a line through what seems to be the middle of the distribution.

Notice the line is not straight. For linear models, the relationship should be linear and one way of achieving that is by transforming the variables. There are several ways to do this transformation but taking the logarithm is the easiest way of doing so. We carry out the transformation and then plot again.

Much better, and now we can fit a regression model to the data.

## 
## Call:
## lm(formula = log.price ~ log.carat, data = diamonds2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36139 -0.17016 -0.00585  0.16587  1.34114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.452142   0.001364  6194.5   <2e-16 ***
## log.carat   1.681371   0.001936   868.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2611 on 53812 degrees of freedom
## Multiple R-squared:  0.9334, Adjusted R-squared:  0.9334 
## F-statistic: 7.542e+05 on 1 and 53812 DF,  p-value: < 2.2e-16

If we wanted to, we could also throw in the other variables to see if we can boost the model’s performance.

## 
## Call:
## lm(formula = log.price ~ log.carat + cut.f + color.f + clarity.f, 
##     data = diamonds2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81367 -0.08621 -0.00065  0.08263  1.92918 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.878793   0.005824 1352.76   <2e-16 ***
## log.carat       1.886239   0.001124 1677.81   <2e-16 ***
## cut.fGood       0.079667   0.003879   20.54   <2e-16 ***
## cut.fVery Good  0.116721   0.003609   32.34   <2e-16 ***
## cut.fPremium    0.139116   0.003570   38.97   <2e-16 ***
## cut.fIdeal      0.160826   0.003539   45.45   <2e-16 ***
## color.fE       -0.054404   0.002103  -25.87   <2e-16 ***
## color.fF       -0.095218   0.002126  -44.78   <2e-16 ***
## color.fG       -0.160878   0.002082  -77.29   <2e-16 ***
## color.fH       -0.250963   0.002210 -113.56   <2e-16 ***
## color.fI       -0.372398   0.002477 -150.36   <2e-16 ***
## color.fJ       -0.511213   0.003061 -166.99   <2e-16 ***
## clarity.fSI2    0.408518   0.005248   77.84   <2e-16 ***
## clarity.fSI1    0.572443   0.005216  109.75   <2e-16 ***
## clarity.fVS2    0.721966   0.005243  137.71   <2e-16 ***
## clarity.fVS1    0.792239   0.005319  148.94   <2e-16 ***
## clarity.fVVS2   0.927682   0.005475  169.45   <2e-16 ***
## clarity.fVVS1   0.999500   0.005626  177.65   <2e-16 ***
## clarity.fIF     1.094550   0.006071  180.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1328 on 53795 degrees of freedom
## Multiple R-squared:  0.9828, Adjusted R-squared:  0.9828 
## F-statistic: 1.706e+05 on 18 and 53795 DF,  p-value: < 2.2e-16

Notice the adjusted R-squared is at 0.9828 so this model is explaining some 98.28% of the variation in price, leaving very little variation in price to be explained.

What influences the number of daily flights?

We can engage this question with the NYC flights data that is found in the nycflights13 data-set.

We start by calculating the number of daily flights on a given day and then plotting this frequency.

## # A tibble: 365 x 2
##    date           n
##    <date>     <int>
##  1 2013-01-01   842
##  2 2013-01-02   943
##  3 2013-01-03   914
##  4 2013-01-04   915
##  5 2013-01-05   720
##  6 2013-01-06   832
##  7 2013-01-07   933
##  8 2013-01-08   899
##  9 2013-01-09   902
## 10 2013-01-10   932
## # ... with 355 more rows

Hmm, this is not an easy plot to read to pick up on any marked patterns but what if broke it out by the day of the week? That might helps us see if the number of flights vary by what day of the week it is. We can do it via box-plots.

So Sundays and Saturdays have the fewest flights, not surprising because most flights are for business purposes and so you are more likely to leave on Sunday for a Monday meeting than on a Saturday (which is why Saturdays have the fewest flights on average).

Let us now fit a regression model to these data, trying to predict the number of flights from the day of the week. Note that the (Intercept) is referring to Sundays.

## 
## Call:
## lm(formula = n ~ wday.f, data = daily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -331.75   -8.69   12.31   25.19  112.38 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  891.481      6.767 131.742  < 2e-16 ***
## wday.fMon     83.327      9.570   8.707  < 2e-16 ***
## wday.fTue     59.878      9.525   6.287 9.44e-10 ***
## wday.fWed     71.212      9.570   7.441 7.48e-13 ***
## wday.fThu     74.269      9.570   7.761 8.92e-14 ***
## wday.fFri     75.981      9.570   7.940 2.64e-14 ***
## wday.fSat   -146.865      9.570 -15.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.8 on 358 degrees of freedom
## Multiple R-squared:  0.7178, Adjusted R-squared:  0.7131 
## F-statistic: 151.8 on 6 and 358 DF,  p-value: < 2.2e-16

If a regression model is capturing most of the variation in the dependent variable, the residuals should be essentially pure noise, i.e., reflecting no systematic pattern whatsoever. What is the residual? You fit a model, generate predicted values of the dependent variable, and then if you deduct the predicted value \((\hat{y})\) from the actual value \((y)\) you will have the residuals. Let us do this next.

## # A tibble: 7 x 2
##   wday.f     n
##   <fct>  <dbl>
## 1 Sun     891.
## 2 Mon     975.
## 3 Tue     951.
## 4 Wed     963.
## 5 Thu     966.
## 6 Fri     967.
## 7 Sat     745.

The grid is a simple table with predicted number of flights for each day of the week. The plot we have drawn shows you the actual number of flights per day of the week through the year as box-plots, and the predicted number of flights as the red dot, one for each day, pulled from the values in grid.

Now we calculate and plot the residuals.

Notice the geom_ref_line(h = 0) drawn where the residual is 0 … which would be the case if predicted and actual number of flights were identical (since then \(y - \hat{y} = 0\)). If we see negative residuals, that would only occur when we overpredict (since then \(\hat{y}\) will exceed \(y\)). If we see positive residuals then we are underpredicting (since then \(y\) will exceed \(\hat{y}\)). These residuals are important because they are representing other patterns in the data, patterns that are not due to the day of the week since that effect was already modeled in the regression.

Why are we seeing huge residuals on certain dates and not others? Let us see what results if we break this plot down by the day of the week.

What you are looking for is whether the residuals are generally close to the zero reference line across the year or if they tend to cross above or below zero for long periods. The one day that shows this deviation is Saturday.

Note, before we move on, that the largest residuals are all negative and seem to occur in June/July, September/October, and November/December. Which dates have the largest residuals? We can find out quite easily.

## # A tibble: 11 x 5
##    date           n wday  wday.f resid
##    <date>     <int> <ord> <fct>  <dbl>
##  1 2013-11-28   634 Thu   Thu    -332.
##  2 2013-11-29   661 Fri   Fri    -306.
##  3 2013-12-25   719 Wed   Wed    -244.
##  4 2013-07-04   737 Thu   Thu    -229.
##  5 2013-12-24   761 Tue   Tue    -190.
##  6 2013-12-31   776 Tue   Tue    -175.
##  7 2013-09-01   718 Sun   Sun    -173.
##  8 2013-05-26   729 Sun   Sun    -162.
##  9 2013-07-05   822 Fri   Fri    -145.
## 10 2013-01-01   842 Tue   Tue    -109.
## 11 2013-01-20   786 Sun   Sun    -105.

So November 28 and 29, followed by December 25, the fourth of July, and so on. Why are the residuals so large and negative! Because our model does not include specific flags for holidays or other special occasions and hence ends up predicting far more flights on these days than the actual number of flights on these days.

We can dig into the mystery of poor predictions for Saturday flights by focusing purely on this day of the week. As we do this we can plot the actual number of flights across the year to see the actual variation in the number of daily flights.

Note the scale_x_date() command; it allows us to create breaks by month and to have abbreviated month names show up on the x-axis. Now, if the number of flights varied randomly, we would see very little pattern in the data but instead it is obvious that Saturday flights increase in February through April, then dip, then start back up again through the middle of August, and then up again towards the end of November through December. Some of these increases are during the holiday season, and then in the summer months when vacation travel rises. This would become more obvious if we created indicators for school terms.

Aha! Notice the Summer increase. But do flights increase for all days of the week in the Summer? To answer that question we would have to plot daily flights by day of the week, color by term, and use a box-plot.

If term had no impact, the three box-plots for each term for a given day of the week would be very similar but that is not the case. The box-plots are least symmetric for Spring and Fall has several outliers. It might be worth it to include term as an independent variable in the regression model and see if that helps generate a better model.

## 
## Call:
## lm(formula = n ~ factor(wday, ordered = FALSE), data = daily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -331.75   -8.69   12.31   25.19  112.38 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       891.481      6.767 131.742  < 2e-16 ***
## factor(wday, ordered = FALSE)Mon   83.327      9.570   8.707  < 2e-16 ***
## factor(wday, ordered = FALSE)Tue   59.878      9.525   6.287 9.44e-10 ***
## factor(wday, ordered = FALSE)Wed   71.212      9.570   7.441 7.48e-13 ***
## factor(wday, ordered = FALSE)Thu   74.269      9.570   7.761 8.92e-14 ***
## factor(wday, ordered = FALSE)Fri   75.981      9.570   7.940 2.64e-14 ***
## factor(wday, ordered = FALSE)Sat -146.865      9.570 -15.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.8 on 358 degrees of freedom
## Multiple R-squared:  0.7178, Adjusted R-squared:  0.7131 
## F-statistic: 151.8 on 6 and 358 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = n ~ factor(wday, ordered = FALSE) + factor(term, 
##     ordered = FALSE), data = daily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -325.61   -6.57   12.35   23.68  118.52 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          882.087      7.049 125.138  < 2e-16
## factor(wday, ordered = FALSE)Mon      83.327      9.122   9.135  < 2e-16
## factor(wday, ordered = FALSE)Tue      60.055      9.079   6.615 1.37e-10
## factor(wday, ordered = FALSE)Wed      70.562      9.123   7.735 1.08e-13
## factor(wday, ordered = FALSE)Thu      73.620      9.123   8.070 1.09e-14
## factor(wday, ordered = FALSE)Fri      75.332      9.123   8.257 2.95e-15
## factor(wday, ordered = FALSE)Sat    -147.515      9.123 -16.169  < 2e-16
## factor(term, ordered = FALSE)summer   37.663      6.378   5.905 8.25e-09
## factor(term, ordered = FALSE)fall      3.903      5.544   0.704    0.482
##                                        
## (Intercept)                         ***
## factor(wday, ordered = FALSE)Mon    ***
## factor(wday, ordered = FALSE)Tue    ***
## factor(wday, ordered = FALSE)Wed    ***
## factor(wday, ordered = FALSE)Thu    ***
## factor(wday, ordered = FALSE)Fri    ***
## factor(wday, ordered = FALSE)Sat    ***
## factor(term, ordered = FALSE)summer ***
## factor(term, ordered = FALSE)fall      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.51 on 356 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.7393 
## F-statistic:   130 on 8 and 356 DF,  p-value: < 2.2e-16

We do see a slight improvement in the adjusted R-squared from 0.7131 to 0.7393. We could stop here or keep digging into the data but doing so will require us to rely on more complicated models that are beyond the scope of this class. Nevertheless, we have done well in terms of analyzing these data and identifying two things –

  1. The day of the week has an impact on the number of daily flights, and
  2. The term (or school-season if you will) also appears to influence the number of daily flights.

So if you were told what day of the week it is and what term it is, you would be able to generate a reasonable prediction about how many flights should be expected on that given day.

Practice Tasks

  1. Look at the other dates with large negative residuals. Can you figure out what might have been happening on Jan 20, May 26, and Sep 1?
  2. If instead of term you used the Month as an independent variable in the model, would you end up with a better model in terms of an average prediction error smaller than 48.4 and a higher adjusted R-squared than 0.7131? Show your work.